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Robust Sparse Canonical Correlation Analysis 


Ines Wilms and Christophe Croux 


Abstract: Canonical correlation analysis (CCA) is a multivariate statistical method which describes the 
associations between two sets of variables. The objective is to find linear combinations of the variables in 
each data set having maximal correlation. This paper discusses a method for Robust Sparse CCA. Sparse 
estimation produces canonical vectors with some of their elements estimated as exactly zero. As such, their 
interpretability is improved. We also robustify the method such that it can cope with outliers in the data. 
To estimate the canonical vectors, we convert the CCA problem into an alternating regression framework, 
and use the sparse Least Trimmed Squares estimator. We illustrate the good performance of the Robust 
Sparse CCA method in several simulation studies and two real data examples. 

Keywords: Canonical correlation analysis, penalized regression, robust regression, sparse Least Trimmed 
Squares 


1 Introduction 


Canonical correlation analysis (CCA), introduced by Hotelling (19361, identifies and quantifies the associ¬ 
ations between two sets of variables. CCA searches for linear combinations, called canonical variates, of 
each of the two sets of variables having maximal correlation. The coefficients of these linear combinations 
are called the canonical vectors. The correlations between the canonical variates are called the canonical 


correlations. For more information on canonical correlations analysis, see e.g. Johnson and Wichern (1998 
Chapter 10). 

Sparse canonical vectors are canonical vectors with some of their elements estimated as exactly zero. The 
canonical variates then only depend on a subset of the variables, those corresponding to the non-zero elements 
of the estimated canonical vectors. Hence, the canonical variates are easier to interpret, in particular for 
high-dimensional data sets. Examples of CCA for high-dimensional data sets can be found in, for example. 


genetics (Gonzalez et al., 2008 Prabhakar and Fridley 2012 Cruz-Cano and Lee 2014) and machine learning 


(Sun et al. 2011). 


Different approaches for sparse CCA have been proposed in the literature. Parkhomenko et al. (2009) 


use a sparse singular value decomposition to derive sparse singular vectors. Witten et al. (2009) develop a 
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penalized matrix decomposition, and show how to apply it for sparse CCA. Waaijenborg et al. (2008), Lykou 


and Whittaker (2010), and Wilms and Croux (2013) convert the CCA problem into a penalized regression 


framework to produce sparse canonical vectors. All these methods are not robust to outliers. A common 
problem in multivariate data sets, however, is the frequent occurrence of outliers. Therefore, the possible 
presence of outliers should be taken into account. 


Several robust CCA methods have been introduced in the literature. Karnel (1991) considers robust 


CCA using an M-estimator of multivariate location and scatter; Dehon and Croux (2002) use the minimum 


covariance determinant (MCD, Rousseeuw, 1985) estimator. Asymptotic properties for CCA based on robust 


estimators of the covariance matrix are discussed in Taskinen et al. (2006). Filzmoser et al. (2000) use a 


robust alternating regression approach, following the approach of Wold (1968), to obtain the first canonical 


variates. Branco et al. (2005) extend the method of Filzmoser et al. (2000) to obtain all canonical variates. 
CCA can also be considered as a prediction problem, where the canonical variates obtained from the first 
data set serve as optimal predictors for the canonical variates of the second data set, and vice versa. As such. 


Adrover and Donato (2015) use a robust M-scale to evaluate the prediction quality, whereas the approach 


of Kudraszow and Maronna (2011) is based on a robust estimator for the multivariate linear model. None 
of these methods, however, are sparse. 

This paper proposes a CCA method that is sparse and robust at the same time. As such, we deal with two 
important topics in applied statistics: sparse model estimation and the presence of outliers in the data. We 
consider CCA from a predictive point of view. We use an alternating robust, sparse regression framework to 
sequentially obtain the canonical variates. We obtain sparse canonical vectors that are resistant to outlying 


observations by using the sparse Least Trimmed Squares (sparse LTS) estimator of Alfons et al. (2013). We 
show that the Robust Sparse CCA method has a clear advantage over standard CCA and Sparse CCA by 
means of a simulation study. The advantage of Robust Sparse CCA over Robust CCA is twofold. First, 
Robust Sparse CCA provides well interpretable canonical vectors since some of the elements of the canonical 
vectors are estimated as exactly zero. Secondly, when the number of variables becomes large compared to 
the sample size, the estimation accuracy of Robust Sparse CCA will be better. Moreover, when the largest 
number of variables of both data sets exceeds the sample size. Robust Sparse CCA can still be performed. 

The remainder of this article is organized as follows. Section 2 considers the CCA problem from a 
predictive point of view. Section 3 discusses the Robust Sparse CCA algorithm. Section 4 presents simulation 
results where we compare Robust Sparse CCA to standard CCA, Robust CCA and Sparse CCA. Two real 
data examples are discussed in Section 4. Section 5 concludes. 
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2 CCA based on alternating regressions 


We consider the CCA problem from a predictive point of view, as proposed by Brillinger (1975) and Izenman 


(1975). Given a sample of n observations Xj S and yi S (i = 1,..., n). The two data matrices are 
denoted as X = [xi, ..., Xn]^ and Y = [yi, ..., yn]^- The estimated canonical vectors are collected in the 
columns of the matrices A G and B G Here r is the number of canonical vectors. The columns of 

the matrices XA and YB contain the estimates of the realizations of the canonical variates, and we denote 
their column by Uj and for 1 < 7 < r. The objective function defining the canonical vector estimates 
is 

n 

(A,B) = argmin VllA^Xi-B^yilp. (1) 

(A.B) 

The objective function in ([^ is minimized under the restriction that each canonical variate Uj is uncorrelated 
with the lower order canonical variates Uk, with 1 < k < j < r. Similarly for the canonical vectors within 
the second set of variables. For identification purpose, a normalization condition requiring the canonical 
vectors to have unit norm is added. Typically, the canonical vectors are obtained by an eigenvalue analysis 
of a certain matrix involving the inverses of sample covariance matrices. But if n < max(g,p), these inverses 
do not exist. 

We estimate the canonical vectors with an alternating regression procedure using the Least Squares 
estimator (Wold 1968). Indeed, if the matrix A in ([^ is kept fixed, the matrix B can be obtained from a 
Least Squares regression of the canonical variates on y (and vice versa for estimating A keeping B fixed). 
The Least Squares estimator, however, is not sparse, nor robust to outliers. Therefore, we replace it by the 


sparse Least Trimmed Squares (sparse LTS) estimator (Alfons et al. 2013). The sparse LTS estimator is a 
sparse version of the Least Trimmed Squares estimator. Sparse model estimates are obtained by adding an 


Li penalty to the LTS objective function, similar as for the lasso regression estimator (Tibshirani 1996). The 


sparse LTS estimator can be applied to high-dimensional data, where the sample size exceeds the number of 
variables and is therefore as well a regularized version of LTS. 

3 Robust Sparse alternating regression algorithm 

We use a sequential algorithm to derive the canonical vectors. 


First canonical vector pair. Denote the first canonical vector pair by (Ai,Bi). Assume that the value 
of Al is known. Denote the vector of squared residuals by r^(Bi) = (ri,..., r^)^, with rf = (Ai^Xi — 
Bi^yi)^, i = 1,..., n. The estimate of Bi is obtained as 

h p 


Bi|Ai = argmin^ (r2(Bi))^^^ -h Abi 

Bi ■ 1 -1 


( 2 ) 
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where > 0 is a sparsity parameter, is the element, j = 1 ,... ,p, of the first canonical vector Bi, 
and (r^(Bi))^.^ < • ■ • < (r^(Bi))^,^ are the order statistics of the squared residuals. The canonical vector 
Bi is normed to length 1. The solution to ([^ equals the sparse LTS estimator with XAi as response and Y 
as predictor. Regularization by adding a penalty term to the objective function is necessary since the design 
matrix Y can be high-dimensional. As such, we get a robust sparse estimate Bi. 

Analogously, for a fixed value Bi, denote the vector of squared residuals by r^(Ai) = (rj,..., r^)^, with 
rf = (Bi'^yi — Ai^Xi)^, z = 1,..., n. The sparse LTS regression estimate of Ai with YBi as response and 
X as predictor, is 

h q 

Ai|Bi = argmin^ + Xa, ^ |Aij|, (3) 

Ai . T • T 

where A^i > 0 is a sparsity parameter, Ai^ is the element, j = 1 ,..., 9 of the first canonical vector Ai, 
and (r 2 (Ai))^^^ < ... < (r 2 (Ai))^^^ are the order statistics of the squared residuals. The canonical vector 
Ai is normed to length 1 . 

This leads to an alternating regression scheme, updating in each step the estimates of the canonical vectors 
until convergence. We iterate until the angle between the estimated canonical vectors in two successive 
iterations is smaller than some tolerance value e (e.g. e = 10“^), and this for Ai and Bi. In the simulations 
we conducted, convergence was almost always reached. 


Higher order canonical vector pairs. We use deflated data matrices to estimate the higher order canonical 


vector pairs (see e.g. Branco et al. 20051. For the second canonical vector pair, the deflated matrices are 
X*, the residuals of a column-by-column LTS regression of X on all lower order canonical variates, Ui in 
this case; and Y*, the residuals of a column-by-column LTS regression of Y on Vi. Since these regressions 
only involve a small number of regressors, they should not be regularized. 

The second canonical variate pair is then obtained by alternating between the following regressions until 
convergence: 

h p 


B;|A* = argmin^ {B*))+ Xb; 


B? 


i=l 


where r2(B*) = (r^ ..., )^, with rf = = 1 ,..., n. 


h q 

A;|B; = argmin^ {r^{Al)).,^ + Xa- ^ \A^j\, (5) 

AS - 1 -1 

2 2=1 3—^ 

where r^(A 2 ) = (rj,...,r^)^, with i = 1,... ,n. The canonical vectors B 2 and A 2 

are both normed to length 1. 

Finally, the second canonical vector needs to be expressed as linear combinations of the columns of the 
original data matrices, and not the deflated ones. Since we want to allow for zero coefficients in these linear 
combinations, a sparse approach is needed. To obtain a sparse A 2 , we regress U 2 on X using the sparse LTS 
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estimator, yielding the fitted values U 2 = XA 2 . To obtain a sparse B 2 , we regress V 2 on Y using the sparse 
LTS estimator, yielding the fitted values V 2 = YB 2 . 

The higher order canonical variate pairs are obtained in a similar way. We perform alternating sparse LTS 
regressions as in Q and ([^, followed by a final sparse LTS step to retrieve the estimated canonical vectors 
(Ak,Bk). Note that the regressions in Q and ([^ should be regularized, since the number of regressors 
equals p or g, which could be larger than n, but sparsity is not really necessary. 


Initial value. A starting value for Ai is required to start up the algorithm. Following [Branco et al.| 


(2005), we first obtain the first principal component of Y, denoted zi. For this aim, we use the algorithm 
of Filzmoser et al. ( 2009[ ) (available in the R package pcaPP) which performs robust principal component 
analysis. We regress Zi on X using the LTS estimator, or the sparse LTS if the number of variables is larger 
than the sample size. The estimated regression coefficient matrix of this regression is used as initial value 
for Al. 


Number of canonical variates to extract. To decide on the number of canonical variates r to extract, we 


use the maximum eigenvalue ratio criterion of An et al. (2013). We apply the Robust Sparse CCA algorithm 


and calculate the robust correlations pi,..., Prmax, with rmax = min(p, q). Each pj is obtained by computing 
the correlation between Vj and Uj from the bivariate Minimum Covariance Determinant estimator with 25% 
trimming. Let kj = pj/pjj^i for j = l,...,rmax— 1. We extract r pairs of canonical variates, where 
r = argmaxjfcj. 

Reweighted sparse LTS. The sparse LTS estimator is computed with trimming proportion 25%, so size 
of the subsample h = [0.75nJ. To increase efficiency, we use a reweighting step afterwards. The reweighted 
sparse LTS is the lasso estimator computed from the observations not detected as outliers by the sparse LTS, 
i.e. having an absolute value of the standardized residuals smaller than or equal to the 98.75**' quantile of 


the standard normal distribution (see Alfons et al. 2013 for more detail). 


Choice of the sparsity parameter. The sparsity parameter controlling the penalization on the regression 


coefficient matrices is selected with the Bayesian Information Criterion (e.g. Yin and Li, 2011). We use a 
range of values for the sparsity parameter A and select the one with the lowest value of 


BICx = -2 log La + kx log(n). 


where Lx is the estimated likelihood, using sparsity parameter A and kx is the number of non-zero estimated 
regression coefficients. 
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Table 1: Simulation settings. In all settings Sxx = Ip and Syy = Iq. 


Simulation setting 


n p 


-'xy 


Sparse Low-dimensional 


100 6 


Non-Sparse Low-dimensional 


250 12 


Sparse High-dimensional 


100 100 


0.9 Oix3 

Osxl 05x3 


0.2 O.lixT 
O.liixl O.liixT 


0.452x2 02x2 

098x2 098x98 


4 Simulation Study 

We compare the performance of the Robust Sparse CCA method with (i) standard CCA, (ii) Robust CCA, 
and (iii) Sparse CCA. The same algorithm is used for all 4 estimators, for ease of comparability. Robust 
CCA uses LTS instead of sparse LTS, and corresponds to the alternating regression approach of |Branco| 


et al. (2005). Standard CCA uses LS instead of LTS, Pearson correlations for computing the canonical 


correlations, and ordinary PCA for getting the initial values. Sparse CCA is like standard CCA, but used 
the lasso instead of LS for the multiple regressions. 


Several simulation settings are considered. In the first simulation design (revised from Branco et al. 


2005), there is one canonical variate pair and the canonical vectors have a sparse structure. The canonical 


vectors are very sparse; each containing only one non-zero element. In the second design, there are two 
canonical variate pairs and the canonical vectors are non-sparse. In the third design, there are a lot of 
variables (p = 100) compared to the sample size (n = 100). There is one canonical variate pair and the 
canonical vectors are sparse. Only Sparse CCA and Robust Sparse CCA can be computed. The number of 
simulations for each setting is M = 500. 


For each setting, the following sampling distributions are considered ([Branco et al. 2005) 
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(a) No contamination. We generate data matrices X and Y according to multivariate normal distributions 
Np+q{0, S), with covariance matrices described in Table 

(b) Symmetric contamination. 90% of the data are generated from fVp_|_q(0, S), and 10% of the data are 
generated from A'p_|_g(0,9S). 

(c) Asymmetric contamination. 90% of the data are generated from Yp_|_g(0, S), and 10% of the observa¬ 
tions equal the point tr(S)l^, where tr(S) is the trace of S. 


4.1 Performance measures. The estimators are evaluated on their estimation accuracy. We compute for 
each simulation run m, with m = 1,... ,M = 500, the angle A) between the subspace spanned by 

the estimated canonical vectors (contained in the columns of A"*) and the subspace spanned by the true 
canonical vectors (contained in the columns of A)0 Analogously for the matrix B. The median angles are 
given by 


6 lMed(A,A)= med 6»'"(A™,A) and 6»Med(B,B)= med 6»’"(B'",B). 

l<m<M l<m<M 


20101 


For evaluating sparsity, we use the true positive rate and the true negative rate (e.g. [Rothman et al. 

--——m 

#{(i, j) : Aij ^0 and Ay ^ 0} 


TPR(A"",A) = . 
TNR(A'",A) = 


#{(b j) : Ay ^ 0} 

#{(b j) : Ay =0 and Ay = 0} 

#{(z,j):Ay =0} 

Analogously for the matrix B. A true positive is a coefficient that is non-zero in the true model, and is 
estimated as non-zero. A true negative is a coefficient that is zero in the true model, and is estimated as zero. 
Both the true positive rate and the true negative rate should be as high as possible for a sparse estimator. 


4.2 Results for the Sparse Low-dimensional design. Simulation results for the estimator A are in 
Table The results for the estimator B are similar and are, therefore, omitted. In the scenario without 
contamination, the results are according to the expectations. The sparse estimators Sparse CCA and Robust 
Sparse CCA achieve a much better median estimation accuracy than the non-sparse estimators CCA and 
Robust CCA. As expected, a sparse method results in increased estimation accuracy when the true canonical 
vectors have a sparse structure. Looking at sparsity recognition performance. Sparse CCA and Robust Sparse 
CCA perform equally good in retrieving the sparsity in the data generating process. 

In the contaminated simulation settings, we see that the robust estimators Robust Sparse CCA and 
Robust CCA maintain their accuracy under contamination. Robust Sparse CCA performs best. Robust 

^The (minimum) angle (9(A, A) is computed as follows. Compute the QR-decompositions A = Q^R-^ and A = QaR-A- 
Next, compute the singular value decomposition of Q~Qa = UCV^. The matrix C is diagonal with elements ci,..., c^. The 
minimum angle is given by 6{A, A) =acos(ci). 
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Table 2: Results for the Sparse Low-dimensional design. Median of the angles between the space spanned 
by the true and estimated canonical vectors; median true positive rate and true negative rate are reported 


for each method. Averages of the angles are between parentheses. 


Method 

No contamination 

Symmetric contamination 

Asymmetric contamination 


^^Med(A, A) 

TPR 

TNR 

^^Med(A, A) 

TPR 

TNR 

^*Med(A, A) 

TPR 

TNR 

CCA 

0.08 



0.16 



0.24 




(0.08) 



(O.IS) 



(0.24) 



Robust CCA 

0.10 



0.14 



0.14 




(0.11) 



(0.14) 



(0.14) 



Sparse CCA 

0.00 

1.00 

1.00 

0.15 

1.00 

0.20 

0.19 

1.00 

0.01 

(0.02) 

(0.98) 

(0.97) 

(0.15) 

(1.00) 

(0.19) 

(0.19) 

(1.00) 

(0.10) 

Robust Sparse CCA 

0.00 

1.00 

1.00 

0.01 

1.00 

0.80 

0.00 

1.00 

1.00 

(0.03) 

(0.99) 

(0.89) 

(0.03) 

(0.99) 

(0.84) 

(0.05) 

(0.98) 

(0.88) 


Table 3: Results for the Non-sparse Low-dimensional design. Median of the angles between the space 
spanned by the true and estimated canonical vectors are reported for each method. Averages of the angles 
are between parentheses. 


Method 

No contamination 

^Med(A, A) 

Symmetric contamination 

^Med(A, A) 

Asymmetric contamination 

^Med(A, A) 

CCA 

0.03 

0.09 

0.10 


(0.03) 

(0.18) 

(0.33) 

Robust CCA 

0.04 

0.04 

0.04 


(0.04) 

(0.04) 

(0.05) 

Sparse CCA 

0.03 

0.11 

0.09 


(0.03) 

(0.21) 

(0.08) 

Robust Sparse CCA 

0.04 

0.04 

0.05 


(0.04) 

(0.04) 

(0.05) 


Sparse CCA clearly outperforms Robust CCA: for instance, we have a median estimation accuracy of 0.01 
(0.00) against 0.14 (0.14) for symmetric (asymmetric) contamination, see Tablej^ The non-robust estimators 
CCA and Sparse CCA are clearly influenced by the introduction of outliers, as reflected by the much higher 
values of the median angles 0Med(A, A). Sparse CCA now performs even worse than Robust CCA. The 
non-robust estimators are most influenced in the more severe asymmetric contamination setting. Looking 
at sparsity recognition performance. Robust Sparse CCA method is hardly affected by the introduction of 
outliers. For the Sparse CCA method, especially the true negative rates is affected. 

4.3 Results for the Non-sparse Low-dimensional design. Summarizing results are in Table Note 
that the true positive rate and true negative rate are omitted since the true canonical vectors are non-sparse. 
In the situation without contamination, all methods show similar performance. The price the sparse methods 
pay is a marginally decreased estimation accuracy, as measured by the median and average angle. In the 
contaminated settings, the robust methods Robust Sparse CCA and Robust CCA perform best and show 
similar performance. 








Table 4: Results for Sparse High-dimensional design. Median of the angles between the space spanned by 
the true and estimated canonical vectors; median true positive rate and true negative rate are reported for 
each method. Averages of the angles are between parentheses. 


Method 

No contamination 

Symmetric contamination 

Asymmetric contamination 


^^Med(A, A) 

TPR 

TNR 

^^Med(A, A) 

TPR 

TNR 

^*Med(A, A) 

TPR 

TNR 

Sparse CCA 

0.06 

1.00 

0.98 

0.19 

1.00 

0.88 

0.33 

1.00 

1.00 

(0.13) 

(0.96) 

(0.97) 

(0.37) 

(0.92) 

(0.86) 

(0.34) 

(1.00) 

(0.98) 

Robust Sparse CCA 

0.08 

1.00 

0.99 

0.11 

1.00 

0.98 

0.08 

1.00 

1.00 

(0.35) 

(0.83) 

(0.98) 

(0.36) 

(0.84) 

(0.98) 

(0.33) 

(0.82) 

(0.99) 


4.4 Results for Sparse High-dimensional design. Summarizing results are in Table In this high¬ 
dimensional setting, only Sparse CCA and Robust Sparse CCA can be performed. In the scenario without 
contamination, Sparse CCA performs best. Sparse CCA is, however, closely followed by Robust Sparse 
CCA both in terms of median estimation accuracy and sparsity recognition performance. When adding 
contamination, the performance of Sparse CCA gets distorted. The median estimation accuracy of Robust 
Sparse CCA is much better than the median estimation accuracy of Sparse CCA, especially in the more 
severe asymmetric contamination setting. 

Similar conclusions can be made when looking at the average angle as a measure of estimation accuracy 
(reported in Tables to between parentheses). In sum. Robust Sparse CCA shows the best overall 
performance in this simulation study. Robust Sparse CCA combines both robustness and sparsity. Therefore, 
it performs best in sparse settings where contamination is present. In sparse non-contaminated settings, 
Robust Sparse CCA shows competitive performance to its main competitor Sparse CCA. In contaminated 
non-sparse settings. Robust Sparse CCA shows competitive performance to its main competitor Robust 
CCA. 


5 Applications 


5.1 Evaporation data set. As a first example, we use a data set from Freund (1979). Ten variables (maxi¬ 
mum, minimum and average soil temperature; maximum, minimum and average air temperature; maximum, 
minimum and average daily relative humidity; and total wind) have been measured on 46 consecutive days 
from June 6 until July 21. The aim is to find and quantify the relations between the soil temperature 
variables and the remaining variables. 


As a first inspection of the data, we use the Distance-Distance plot (Rousseeuw and van Zomeren, 1990) 


in Figure[2 The Distance-distance plot displays the robust distances versus the Mahalanobis distances. The 
vertical and horizontal lines are drawn at values equal to the square root of the 97.5% quantile of a chi- 
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Distance-Distance Piot 



Figure 1: Distance-Distance Plot for the Evaporation data 


squared distribution with 10 degrees of freedom. Points beyond those lines would be considered as outliers. 
The Distance-Distance plot reveals some outliers: objects 31 and 32, for example, are extreme outliers. This 
suggests the need for a robust CCA method. 


We use the maximum eigenvalue ration criterion (e.g. An et al. 2013), as discussed in Section 3, to decide 
on the number of canonical variate pairs to extract. For all CCA methods two canonical variate pairs are 
extracted. To compare the performance of the CCA approaches, we perform a leave-one-out cross-validation 
exercise and compute the cross-validation score 


i=l 

where and contain the estimated canonical vectors when the observation is left out of the 
estimation sample and h = [naj, with a = 1 (0% Trimming) or a = 0.9 (10% Trimming). We use trimming 
to eliminate the effect of outliers in the cross-validation score. The method that achieves the lowest cross- 
validation score has the best out-of-sample performance. Table reports the cross-validation scores for all 
methods. Robust Sparse CCA achieves the best cross-validation score. 

Table shows the estimated canonical vectors for the Robust CCA and Robust Sparse CCA method. 
By adding the penatly term, the number of non-zero coefficients is reduced from 20 to 10, improving inter- 
pretability of the canonical vectors. The price to pay for the sparseness is a slight decrease in the estimated 
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Table 5: Evaporation data set: Cross-validation score for each method. 


Method 

CV-score 

0% Trimming 

CV-score 

10% Trimming 

CCA 

1.48 

0.98 

Robust CCA 

1.26 

0.73 

Sparse CCA 

1.63 

0.68 

Robust Sparse CCA 

0.90 

0.66 


Table 6: Evaporation data set: Estimated canonical vectors using Robust CCA and Robust Sparse CCA. 


Variables \ Canonical Vectors 

Robust CCA 

1 2 

Robust Sparse CCA 

1 2 

First data set 

MAXST: Max. daily soil temperature 

-0.21 

-0.73 

0 

0.76 

MINST: Min. daily soil temperature 

-0.02 

0.67 

0 

-0.63 

AVST: Avg. daily soil temperature 

0.98 

0.13 

1 

-0.15 

Second data set 

MAXAT: Max. daily air temperature 

0.61 

0.01 

0.94 

0 

MINAT: Min. daily air temperature 

0.53 

0.75 

0.15 

-0.84 

AVAT: Avg. daily air temperature 

0.08 

-0.05 

0.17 

0 

MAXH: Max. daily relative humidity 

-0.11 

0.05 

0 

0 

MINE: Min. daily relative humidity 

0.16 

0.54 

0 

-0.53 

AVH: Avg. daily relative humidity 

-0.43 

0.33 

-0.24 

0 

WIND: Total wind, measured in miles per day 

-0.35 

0 

0 

0 

Canonical correlations 

0.89 

0.59 

0.87 

0.48 


11 



Table 7: Nutrimouse data set: Cross-validation score for each method. 


Method 

CV-score 

CV-score 


0% Trimming 

10% Trimming 

Sparse CCA 

190.89 

114.42 

Robust Sparse CCA 

78.57 

37.10 


canonical correlations (computed using the bivariate MCD estimator, see Section 3): they drop from 0.89 
to 0.87 for the first one, and from 0.59 to 0.48 for the second canonical correlation. We find this decrease 
acceptable, given the gained sparsity in the canonical vectors. The sparse structure of the canonical vectors 
facilitates interpretation. The first canonical variate in the soil temperature data set, for instance, is uniquely 
determined by the variable AVST. 


5.2 Nutrimouse data set. As a second example, we use the nutrimouse data set (publicly available in 


the R package CCA; Gonzalez et ah, 20081. Two sets of variables, i.e. gene expressions and fatty acids, are 
available for 40 mice. The first set contains expressions of 120 genes measured in liver cells. The second 
set of variables contains concentrations of 21 hepatic fatty acids (FA). Furthermore, the mice are classified 
according to two factors: genotypes (wild-type and PPARa deficient mice) and diets (reference diet “REF”, 
hydrogenated coconut oil diet “COG”, sunflower oil diet “SUN”, linseed oil diet “LIN” and fish oil diet 


“FISH”). More details on how the data were obtained can be found in Martin et al. (2007). The aim is to 
identify a small set of genes which are correlated with the fatty acids. 

In this data set, the number of experimental units (i.e. data on 40 mice) is smaller than the number of 


variables (i.e. 120 genes). Therefore, standard CCA nor robust CCA can be performed. Gonzalez et al. 


(2008) use the Canonical Ridge. The Canonical Ridge performs regularization, but the canonical vectors 
produced by the Canonical Ridge are not sparse. Robust Sparse CCA and Sparse CCA, can also be applied 
in this high-dimensional setting and produce - more interpretable - sparse canonical vectors. 

We compare the performance of the Robust Sparse CCA method to the Sparse CCA method. For both 
methods, two canonical variate pairs are extracted using the maximum eigenvalue ration criterion from 
Section 3. We perform the same leave-one-out cross-validation exercise as described in Section 5.1. Results 
are reported in Table The Robust Sparse CCA method outperforms the Sparse CCA method. The 
cross-validation scores are reduced by about 60% when using the robust method. 

Next, we discuss the estimated canonical vectors obtained using the Robust Sparse CCA method. The 
most informative biological conclusions can be drawn from (i) the selected genes in the first canonical vector 


and (ii) the selected fatty acids in the second canonical vector (for further motivation see, Martin et al. 2007 


Gonzalez et al. 2008). Therefore, we focus on these results. First, the left panel of Figure displays the 
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Figure 2: Coefficients of selected genes (left) and coefficients of selected fatty acids (right). 


coefficients of the selected genes, i.e. those genes with non-zero estimated coefficients, in the first canonical 
vector. 24 out of 120 variables are selected. The solution is very sparse, facilitating interpretation. Two 


important genes are CypSall and CARl. Martin et al. (20071 find a consistent reduction of CypSall in PPARa 
livers on the one hand, and an overexpression of CARl on the other hand. Both genes are selected and have 
among the highest (absolute) coefficients. Second, we focus on the fatty acids and discuss the right panel of 
Figure The right panel displays the coefficients of the selected fatty acids in the second canonical vector. 
12 out of 21 fatty acid variables are selected. The fatty acids are related to the effect of certain diets used 
in this experiment. The mice are classified according to a specific diet. Five diets are used, which contain 
specific concentration levels of the fatty acids C22:6n-3, C22;5n-3, C22:5n-6, C22:4n-3 and C20:5ii-3 


respectively (Martin et al., 20071. Looking at the selected fatty acids in the second canonical vector, we see 
that four out of these five variables are selected and have among the highest (absolute) coefficients. 

In sum, the strong genotype effect observed through the first canonical variate and the strong diet effect 
observed through the second canonical variate is in accordance with conclusions drawn in [Martin et al.| 


(2007). 
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6 Conclusion 


Sparse Canonical Correlation Analysis delivers interpretable canonical vectors, with some of its elements 
estimated as exactly zero. Robust Sparse CCA retains this advantage, while at the same ensuring that the 
estimation of the canonical vectors is not affected by outlying observations. 

The canonical vectors are given by the eigenvectors of two particular matrices, see for instance [Johnson 


and Wichern (1998 Chapter 10). Typically, the canonical vectors are estimated by taking the sample 


versions of those covariance matrices and computing the corresponding eigenvectors. One could think of 
estimating those covariance matrices with an estimator that is robust and sparse at the same time, and 
then, to compute the eigenvectors. This approach, however, would results in canonical vectors being non- 
sparse. To circumvent this pitfall, we reformulate the CCA problem in a regression framework. We use the 


sparse LTS estimator of Alfons et al. (2013) to obtain sparse canonical vectors that are resistant to outlying 
observations. 

A simulation study and two empirical examples show the advantages of the Robust Sparse CCA method 
over its natural benchmarks. In contaminated settings. Robust Sparse CCA outperforms standard CCA and 
Sparse CCA. Robust Sparse CCA has two important advantages over Robust CCA. First, Robust Sparse 
CCA improves model interpretation since only a limited number of variables, those corresponding to the non¬ 
zero elements of the canonical vectors, enter the estimated canonical variates (cfr. evaporation application). 
Second, if the number of variables approaches the sample size, the estimation precision of Robust CCA 
suffers. If the number of variables exceeds the sample size. Robust CCA can not even be performed. Due 
to the regularization performed by Robust Sparse CCA, Robust Sparse CCA can still be computed (cfr. 
nutrimouse application). 

Several questions are left for future research. One could think of a joint selection criterion for the number 
of canonical variate pairs and the sparsity parameter. This would, however, increase computation time 
substantially. To induce sparsity in the canonical vectors, we use a Lasso penalty. Other penalty functions 


such as the Adaptive Lasso (Zou 2006) could be considered. The Adaptive Lasso is consistent for variable 
selection, whereas the Lasso is not. Furthermore, we use a regularized version of the LTS estimator. One 
could also use a regularized version of the S-estimator or the MM-estimator to increase efficiency. Up to our 
knowledge, however, the sparse LTS is the only robust sparse regression estimator for which efficient code 


(Alfons 2014) is available. 
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